## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.0 0.27 0.36 20.7 0.045
## 2 6.3 0.30 0.34 1.6 0.049
## 3 8.1 0.28 0.40 6.9 0.050
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
This data is realted to white wine. This data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. what we want to learn from this data is mainly Which chemical properties influence the quality of white wines? (11 variables are: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, PH, sulphates, alcohol) From these variables, i can see there may have some correlations among these variables.
## [1] 4898 12
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
From this picture, we can see the quality distribution appears normal with peaking equals to 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
The range of fixed.acidity is (3.8, 14.2).The most frequent part is from 6.3 to 7.3. And the distribution follows normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
The range of volatile.acidity is (0.08, 1.1).The most frequent part is from 0.21 to 0.28. And the distribution follows normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
The range of volatile.acidity is (0, 1.66).The most frequent part is from 0.27 to 0.39. And the distribution follows normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
From the first picture, we can see the distribution is right-skewed, so i decide to plot on log scale, The tranformed residual sugar distribution appears bimodal with the peaking around 2 or so and again at 10 or so.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
The range of chlorides is (0.009, 0.346).The most frequent part is from 0.036 to 0.05. And the distribution follows normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
The range of free.sulfur.dioxide is (2, 289).The most frequent part is from 23 to 46. And the distribution follows normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
The range of total.sulfur.dioxide is (9, 440).The most frequent part is from 108 to 167. And the distribution follows normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The range of density is (0.9871, 1.039).The most frequent part is from 0.99 to 0.996.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The range of pH is (2.72, 3.82).The most frequent part is from 3 to 3.3. And the distribution follows normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
The range of sulphates is (0.22, 1.08).The most frequent part is from 0.4 to 0.55.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
From the first picture, we can see this is a right-skewed distribution, so i decide to transform with log function, but it still a litter right skewed.
And we can see the mode is 9.4
There are 4898 white wines with 11 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, PH, sulphates, alcohol). All features are numeric variables. The median and mode quality is 6.
alcohol
Fixed.acidity, citric acid,free.sulfur,diocide,pH and residual.sugar
No
No
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.289180698
## volatile.acidity -0.02269729 1.00000000 -0.149471811
## citric.acid 0.28918070 -0.14947181 1.000000000
## residual.sugar 0.08902070 0.06428606 0.094211624
## chlorides 0.02308564 0.07051157 0.114364448
## free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
## total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
## density 0.26533101 0.02711385 0.149502571
## pH -0.42585829 -0.03191537 -0.163748211
## sulphates -0.01714299 -0.03572815 0.062330940
## alcohol -0.12088112 0.06771794 -0.075728730
## quality -0.11366283 -0.19472297 -0.009209091
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.08902070 0.02308564 -0.0493958591
## volatile.acidity 0.06428606 0.07051157 -0.0970119393
## citric.acid 0.09421162 0.11436445 0.0940772210
## residual.sugar 1.00000000 0.08868454 0.2990983537
## chlorides 0.08868454 1.00000000 0.1013923521
## free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
## total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
## density 0.83896645 0.25721132 0.2942104109
## pH -0.19413345 -0.09043946 -0.0006177961
## sulphates -0.02666437 0.01676288 0.0592172458
## alcohol -0.45063122 -0.36018871 -0.2501039415
## quality -0.09757683 -0.20993441 0.0081580671
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.4258582910
## volatile.acidity 0.089260504 0.02711385 -0.0319153683
## citric.acid 0.121130798 0.14950257 -0.1637482114
## residual.sugar 0.401439311 0.83896645 -0.1941334540
## chlorides 0.198910300 0.25721132 -0.0904394560
## free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
## total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
## density 0.529881324 1.00000000 -0.0935914935
## pH 0.002320972 -0.09359149 1.0000000000
## sulphates 0.134562367 0.07449315 0.1559514973
## alcohol -0.448892102 -0.78013762 0.1214320987
## quality -0.174737218 -0.30712331 0.0994272457
## sulphates alcohol quality
## fixed.acidity -0.01714299 -0.12088112 -0.113662831
## volatile.acidity -0.03572815 0.06771794 -0.194722969
## citric.acid 0.06233094 -0.07572873 -0.009209091
## residual.sugar -0.02666437 -0.45063122 -0.097576829
## chlorides 0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide 0.05921725 -0.25010394 0.008158067
## total.sulfur.dioxide 0.13456237 -0.44889210 -0.174737218
## density 0.07449315 -0.78013762 -0.307123313
## pH 0.15595150 0.12143210 0.099427246
## sulphates 1.00000000 -0.01743277 0.053677877
## alcohol -0.01743277 1.00000000 0.435574715
## quality 0.05367788 0.43557472 1.000000000
From a subset of the data, alcohol, density seem to have stronger correlations with quality than other features, but residual sugar and total sulfur dioxide are moderately correlated with alcohol and density. I want to look closer at scatter plots involving quality and some variables like alcohol, density and residual sugar.
Comparing alcohol to quality, the first plot suffers from some overplotting. and the small positive correlation seen in the earlier table is easy to see here.we can see compared to low alcohol percentage, the high alcohol percentage has more high quality.
##
## Call:
## lm(formula = df$quality ~ df$alcohol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5317 -0.5286 0.0012 0.4996 3.1579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.582009 0.098008 26.34 <2e-16 ***
## df$alcohol 0.313469 0.009258 33.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7973 on 4896 degrees of freedom
## Multiple R-squared: 0.1897, Adjusted R-squared: 0.1896
## F-statistic: 1146 on 1 and 4896 DF, p-value: < 2.2e-16
Comparing density to quality, the first plot suffers from some overplotting. and the small negative correlation seen in the earlier table is easy to see here.we can see compared to low alcohol percentage, the low density has more high quality.
##
## Call:
## lm(formula = df$quality ~ df$density)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1441 -0.6258 0.0005 0.5162 4.2102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.277 4.003 24.05 <2e-16 ***
## df$density -90.942 4.027 -22.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8429 on 4896 degrees of freedom
## Multiple R-squared: 0.09432, Adjusted R-squared: 0.09414
## F-statistic: 509.9 on 1 and 4896 DF, p-value: < 2.2e-16
Comparing total.sulfur.dioxide to quality, the first plot suffers from some overplotting. Most white wines have a total.sulfur.dioxide between 100 and 200 (no units), and the lack of correlation seen in the earlier table is easy to see here.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Comparing chlorides to quality, the first plot suffers from some overplotting. Most white wines have a chlorides between 0.036 and 0.05 , and the small negative correlation seen in the earlier table is easy to see here.
Comparing residual sugar to alcohol, the first plot suffers from some overplotting. and the small negative correlation seen in the earlier table is easy to see here.
Comparing density to alcohol, the strong negative correlation seen in the earlier table is easy to see here.
Comparing total sulfur dioxide to alcohol, the small negative correlation seen in the earlier table is easy to see here.
From this picture, we can easily see the there are some positive relation between residual sugar and density
Same with residual sugar, total sulfur dioxide also has some little positive relation with density
Quality correlates strongly with alcohol and density.
As alcohol percentage increases, the variance in quality increases. In the plot of quality vs alcohol, there are horizonal bands where many white wines take on the different alcohol value at same quality points. The relationship between quality and alcohol appears to be exponential rather than linear.
Based on the R^2 value, alcohol only explains about 19 percent of the variance in quality. Other features of interest can be incorporated into the model to explain the variance in the quality.
The alcohol of a white wine tend to correlate with each other. The more alcohol, then the lesser the water and sugar. The alcohol also correlate with density and residual sugar which makes sense.
The quality of a white wine is positively and slightly correlated with alcohol and negatively correlated with density. The variables chlorides and volatile acidity also correlate with the price but less strongly than alcohol and density. Either alcohol or density could be used in a model to predict the quality of white wines, however, both variables should not be used since they show perfect correlation.
from this plot, we can see the residual sugar has strong negative relation with alcohol, but these is no obvious relation with quality.
##
## Call:
## lm(formula = df$quality ~ df$alcohol + df$density)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5670 -0.5242 -0.0003 0.4881 3.0898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22.49170 6.16503 -3.648 0.000267 ***
## df$alcohol 0.36036 0.01478 24.389 < 2e-16 ***
## df$density 24.72842 6.07937 4.068 4.82e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.796 on 4895 degrees of freedom
## Multiple R-squared: 0.1925, Adjusted R-squared: 0.1921
## F-statistic: 583.3 on 2 and 4895 DF, p-value: < 2.2e-16
same as the residual sugar. density has relation with alcohol, but there is no obvious relation with quality
##
## Call:
## lm(formula = df$quality ~ df$alcohol + df$chlorides)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5575 -0.5179 -0.0143 0.4913 3.1295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.861231 0.116382 24.585 < 2e-16 ***
## df$alcohol 0.297669 0.009906 30.051 < 2e-16 ***
## df$chlorides -2.470822 0.557945 -4.428 9.7e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7958 on 4895 degrees of freedom
## Multiple R-squared: 0.193, Adjusted R-squared: 0.1926
## F-statistic: 585.2 on 2 and 4895 DF, p-value: < 2.2e-16
I cannot find any features can strengthen each other. there’s an absence of strong correlations
Yes, almost all the features do not have strong relations with quality of white wines.
Yes, I created a linear model starting from the quality and the alcohol.
The variables in the linear model account for 18.96% of the variance in the quality of white wines. The addition of the density variable to the model slightly improves the R^2 value to 19.2%.
Limitation: It cannot explain a lot of the quality variance. Strength: All the variables are significant. it mean the alcohol percentage and density can affect the quality of the white wines.
The distribution of white wines quality appears to be normal. The largest amound white wines’ quality is 6, the middle one.
white wines with high alcohol have the high quality.
The white wines data set contains information on 4,898 thousand white wines across 13 variables. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of white wines across many variables and created a linear model to predict white wines quality.
There was a blurry trend between the density or alcohol percentage and its quality. I was surprised that volatile acidity and citric acid did not have a strong positive correlation with quality.
Some limitations of this model:I struggled trying to increase the R^2 of the model. but without any additional findings of the strong relation with quality, my model can only explain 19.2 percentage of the quality variance.